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Abstract 

We show that Closest Substring, one of the most important problems in the field of biological 
sequence analysis, is W[l]-hard when parameterized by the number k of input strings (and 
remains so, even over a binary alphabet). This problem is therefore unlikely to be solvable in 
time 0{f{k) ■ n'^) for any function / of fc and constant c independent of k. The problem can 
therefore be expected to be intractable, in any practical sense, for fc > 3. Our result supports 
the intuition that Closest Substring is computationally much harder than the special case of 
Closest String, although both problems are NP-complete. We also prove W[l]-hardness for 
other parameterizations in the case of unbounded alphabet size. Our W[l]-hardness result for 
Closest Substring generalizes to Consensus Patterns, a problem of similar significance 
in computational biology. 



Introduction 



Motif search problems are of central importance for sequence analysis in computational molecular 
biology. These problems have applications in fields such as genetic drug target identification or 
signal finding (see |5|, |l^, ^ ^ ^ and the references cited therein for more details and further 



apphcations). Two core problems in this context are CLOSEST Substring ||21| and Consensus 
Patterns 
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Input: k strings si, S2, ■ ■ ■ ,Sk over alphabet T, and non-negative integers d and L. 

Question in case of Closest Substring: Is there a string s of length L, and 
for i = 1, . . . , fc, a substring s'^ of length L such that, for all i = 1, . . . , k, dH{s, s'j) < d? 
(Here dnisjs'^) denotes the Hamming distance between s and s'^.) 

Question in case of Consensus Patterns: Is there a string s of length L, and 
for i = 1, . . . , fc, a substring s[ of length L such that, Yli=i dnis, s[) < d? 

What is currently known about these two problems is summarized as follows. 
The Closest Substring Problem. 



1. Closest Substring is NP-complete, and remains so for the special case of the Closest 
String problem, where the string s that we search for is of same length as the input strings. 
Closest String is NP-complete even for the further restriction to a binary alphabet [l^, 18 1. 

2. On the positive side, both Closest Substring and Closest String admit polynomial 
time approximation schemes (PTAS's), where the objective function is the maximum length 



of the string s [19, M, 21 



3. In the PTAS's for both Closest String and Closest Substring, the exponent of the 
polynomial bounding the running time depends on the goodness of the approximation. These 
are not efficient PTAS's (EPTAS's) in the sense of |p and therefore are probably not useful 
for bioinformatics practice. Whether EPTAS's are possible for these approximation problems, 
or whether they are l^[l]-hard (for the parameter k = 1/e, where the approximation is to 
within a factor of (1 + e) of optimal), currently remains open. 

4. Closest String is fixed-parameter tractable with respect to the parameter d, and can be 
solved in time 0{kL + kd ■ d'^) M 



5. Closest String is also fixed-parameter tractable with respect to the parameter k, but here 
the exponential parametric function is much faster growing, and the algorithm is probably 



of less practical use (see, however, |15] for some encouraging experimental results also in this 
case) . 



The Consensus Patterns Problem. 



1. Consensus Patterns is NP-complete and remains so for the restriction to a binary alpha- 
bet 11]. 



2. Consensus Patterns admits a PTAS [|19|, where the objective function is the maximum 
length of the string s. 

3. The known PTAS's for Consensus Patterns are not EPTAS's, and whether EPTAS's are 
possible, or whether PTAS approximation for this objective function is VF[l]-hard, is an 
important issue that also currently remains open. 
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W[l]-hard(*) 


d, k, L 


FPT 
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Table 1: Overview on the parameterized complexity of Closest Substring and Consensus 
Patterns with respect to different parameterizations, where k is the number of given strings, L 
is the length of the substrings we search for, and d is the Hamming distance allowed. Results from 
this paper are marked by (*). The FPT results for constant size alphabet can be achieved by 
enumerating all length L strings over S. Open questions are indicated by a question mark. 



The key distinguishing point between Closest Substring and Consensus Patterns lies in 
the definition of the distance measure d between the "solution" string s and the substrings of 
the k input strings. Whereas Closest Substring uses a maximum distance metric. Consensus 
Patterns uses the sum of distances metric. This is of particular importance when discussing 
values of parameter d occurring in practice. Whereas it makes good sense for many applications to 
assume that d is a fairly small number in case of Closest Substring, this is much less reasonable 
in the case of Consensus Patterns. This will be of some importance when discussing our result 
for Consensus Patterns. 



Many algorithms applied in practice try to solve motif search problems exactly, often using enumera- 
tive approaches in combination with heuristics ||2|, Isl |2^. In this paper, we explore the parameterized 



complexity of the basic motif problems in the framework of |10]. 



Our Main Results. 



Unfortunately, our main results are negative ones: we show that Closest Substring and Con- 
sensus Patterns are W[l]-hard with respect to the parameter k of the number of input strings, 
even in case of a binary alphabet. 

For unbounded alphabet size, we show that the problems are W[l]-hard for the combined param- 
eters L, d, and k. In the case of constant alphabet size, the complexity of the problems remain 
open when parameterized by d and k together, or by d alone. Note that in the case of Consensus 
Patterns our result gains particular importance, because here the distance parameter d usually is 
not small, whereas assuming that k is small is reasonable. Until now, it was known only that if one 
additionally considers the substring length L as a parameter, then running times exponential in L 



can be achieved O, 11, 28 1. An overview on known parameterized complexity results for Closest 



Substring and Consensus Patterns is given in Table [l|. 

We achieve our results by giving parameterized many-one reductions from the W[l]-complete 
Clique problem to the respective problems. It is important here to note that parameterized 
reductions are much more fine-grained than conventional polynomial time reductions used in NP- 
completeness proofs, since parameterized reductions have to take care of the parameters. Estab- 
lishing that Closest Substring and Consensus Patterns are W[l]-hard with respect to the 
parameter k requires significantly more technical effort than the already known demonstrations of 
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NP-completeness. Finally, our work gives strong theory-based support for the common intuition 
that Closest Substring (W[l]-hard) seems to be a much harder problem than Closest String 
(in FPT [jl6|). Notably, this could not be expressed by "classical complexity measures," since both 
problems are NP-complete as well as both do have a PTAS. 

Our work is organized as follows. In Section ^, we provide some background on parameterized com- 
plexity theory and we give a brief overview on related computational biology results. Afterwards, 
in Section ^, we present a parameterized reduction of Clique to Closest Substring in case of 
unbounded input alphabet size. Then, in Section this is specialized to the case of binary input 
alphabet. Finally, Section [s] gives similar constructions and results for Consensus Patterns and 
the paper concludes with a brief summary and open questions in Section 0. 



2 Preliminaries and Previous Work 

In this section, we start with a brief introduction to parameterized complexity (more details can 



be found in the monograph [10| and the recent survey articles |||, 



2.1 A Crash Course in Parameterized Complexity 

Given an undirected graph G = {V, E) with vertex set V , edge set E, and a positive integer k, 
the NP-complete Vertex Cover problem is to determine whether there is a subset of vertices 
C Q V with k or fewer vertices such that each edge in E has at least one of its endpoints in C. 
Vertex Cover is fixed-parameter tractable. There now are algorithms solving it in time less than 
0{kn + 1.3*^) 1^, |2^. The corresponding complexity class is called FPT. By way of contrast, 
consider the NP-complete Clique problem: Given an undirected graph G = {V, E) and a positive 
integer k, Clique asks whether there is a subset of vertices C C 1/ with at least k vertices such 
that C forms a clique by having all possible edges between the vertices in G. Clique appears to 
be fixed-parameter intractable: It is not known whether it can be solved in time f{k) ■ n^^^\ where 
/ might be an arbitrarily fast growing function only depending on k. 

The best known algorithm solving Clique runs in time 0{n^^^^) [^] , where c is the exponent in 
the time bound for multiplying two integer nxn matrices (currently best known, c = 2.38, see ||8|). 
The decisive point is that k appears in the exponent of n, and there seems to be no way "to shift 
the combinatorial explosion only into /c," independent from n. 

Downey and Fellows developed a completeness program for showing parameterized intractabil- 
ity [^]. However, the completeness theory of parameterized intractability involves significantly 
more technical effort (as will also become clear when following the proofs presented in this paper). 
We very briefly sketch some integral parts of this theory in the following. 

Let L,L' C S* X N be two parameterized languages.^ For example, in the case of Clique, the 
first component is the input graph coded over some alphabet S and the second component is the 

^In general, the second component (representing the parameter) can also be drawn from E*; for most cases, and, 
in particular, in this paper, assuming the parameter to be a positive integer is sufficient. 
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positive integer k, that is, the parameter. We say that L reduces to L' by a standard parameterized 
m-reduction if there are functions k ^ k' and k i— > k" from N to N and a function (x, k) ^ x' from 
S* X N to E* such that 

1. (x, /c) i-H- x' is computable in time ^''Ixl'^ for some constant c and 

2. G L iff {x\k') G L'. 

Notably, most reductions from classical complexity turn out not to be parameterized ones. The 
basic reference degree for parameterized intractability, W[l], can be defined as the class of parame- 
terized languages that are equivalent to the Short Turing Machine Acceptance problem (also 
known as the /c-Step Halting problem). Here, we want to determine, for an input consisting of 
a nondeterministic Turing machine M (with unbounded nondeterminism and alphabet size), and 
a string x, whether M has a computation path accepting x in at most k steps. This can trivially 
be solved in time 0(n^+^) by exploring all /c-step computation paths exhaustively, and we would 
be surprised if this can be much improved. 

Therefore, this is the parameterized analogue of the Turing Machine Acceptance problem that 
is the basic generic NP-complete problem in classical complexity theory, and the conjecture that 
FPT 7^ W[l] is very much analogous to the conjecture that P ^ NP. Other problems that are 
[Incomplete (there are many) include Clique and Independent Set, where the parameter is 
the size of the relevant vertex set |^, |l^ . 

From a practical point of view, W[l]-hardness gives a concrete indication that a parameterized 
problem with parameter k problem is unlikely to allow for an algorithm with a running time of the 
form f{k) ■ n^^^\ 



2.2 Motivation and Previous Results 



Many biological problems with respect to DNA, RNA, or protein sequences can be solved based 
on consensus word analysis |^, Section 8.6]; Closest Substring and Consensus Patterns are 
central problems in this context |14, Applications include locating binding sites and 

finding conserved regions in unaligned sequences for genetic drug target identification, for designing 
genetic probes, and for universal PGR primer design. These problems can be regarded as various 
generalizations of the common substring problem, allowing errors (see [H, ^, |2|, |2l| and references 
there). This leads to Closest Substring and Consensus Patterns, where errors are modeled 
by the (Hamming) distance parameter d. 



There is a straightforward factor-2-approximation algorithm for CLOSEST Substring 
better-than-2 approximation with factor 2 



2/(2|S| + 1) was given by Li et al. [|19| 



The first 
Finally, there 



are PTASs for Consensus Patterns |ll9|, |20|] as well as for Closest Substring |2j, both of 
which, however, have impractical running times. 



Concerning exact (parameterized) algorithms, we only briefly mention that, e.g., Sagot [28| studies 



motif discovery by solving Closest Substring, Evans and Wareham [11| give FPT algorithms for 
the same problem, and Blanchette et al. Q developed a so-called phylogenetic footprinting method 
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for a slightly more general version of Consensus Patterns. All these results, however, make 
essential use of the parameter "substring length" L and the running times show exponential behavior 
with respect to L. To circumvent the computational limitations for larger values of L, many 
heuristics were proposed, e.g., Pevzner and Sze [26| present algorithms called WINNOWER (wrt. 
Closest Substring) and SP-STAR (wrt. Consensus Patterns), and Buhler and Tompa |§ 
use random projections to find closest substrings. Our analysis makes a first step towards showing 
that, for exact solutions, we have to include L in the exponential growth; namely, we show that it 
is highly unlikely to find algorithms with a running time exponential only in k. 



3 Closest Substring: Unbounded Alphabet 

We first describe a reduction from the W[l]-hard Clique problem to Closest Substring which is 
a parameterized m-reduction with respect to the aggregate parameter (L, d, k) in case of unbounded 
alphabet size. 



3.1 Reduction of Clique to Closest Substring 

A Clique instance is given by an undirected graph G = {V, E), with a set y = {vi,V2, . . . ,Vn} of 
n vertices, a set E oim edges, and a positive integer k denoting the desired clique size. We describe 
how to generate a set S of (2) strings such that G has a clique of size k iff there is a string s of 
length L := k + 1 such that every Si G S has a substring s'^ of length L with dn^s, s[) < d := k — 2. 
If a string Si G S has a substring s'^ of length L with dn^s, s^) < d, we call a match. We assume 
k > 2, because k = 1,2 are trivial cases. 

Alphabet. The alphabet of the produced instance is given by the disjoint union of the following 
sets: 

• {(^i \ Vi (z V }, i.e., an alphabet symbol for every vertex of the input graph; we call them 
encoding symbols; 

• { V^j I J = 1' • • • 1 (2) }' ^ unique symbol for every of the (2) produced strings; we call 
them string identification symbols; 

• {#} which we call the synchronizing symbol. 
This makes a total of n + (2) + 1 alphabet symbols. 

Choice strings. We generate a set of (2) choice strings Sc = {ci^2, ■ ■ ■ , ci^k, £2,3, 02,4, . . . , Ck-i,k} 
and we assume that the strings in Sc are ordered as shown. Every choice string will encode the 
whole graph; it consists of m concatenated strings, each of length k + 1, called blocks; by this, we 
have one block for every edge of the graph. The blocks will be separated by barriers, which are 
length k strings consisting of k identification symbols corresponding to the respective string. A 
choice string Cij, which, according to the given order, is the i'th choice string in Sc, is given by 

Cjj := (block(i,i, ei)) {ipi')'' (block(z, j, 62)) {(fi')'' . . . {(fi')'' (block (z, j, 6^,)), 
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Figure 1: Example for the reduction from a Clique instance G with fc = 3 (shown in (a)) to a 
Closest Substring instance with bounded alphabet (shown in (b)) as explained in Example |l|. 
In (b), we display the constructed strings ci, C2, and C3 (the contained blocks are highlighted by 
bold boxes) and the solution string s that is found, since G has a clique of size /c = 3; s is a string 
of length + 1 = 4 such that ci, C2, and C3 have length 4 substrings (indicated by dashed boxes) 
that have Hamming distance at most A; — 2 = 1 to s. 

where ei, 62, . . . , are the edges of G and (block()) will be defined below. The solution string s 
will have length A; + 1, which is exactly the length of one block. 

Block in a choice string. Every block is a string of length k + 1 and it encodes an edge of 
the input graph. Every choice string contains a block for every edge of the input graph; different 
choice strings, however, encode the edges in different positions of their blocks: For a block in choice 
string Cjj, positions i and j are called active and these positions encode the edge. Let e be the edge 
to be encoded and let e connect vertices Vr and Vs, 1 < r < s < n. Then, the ith position of the 
block is (Tr in order to encode Vr and the jth. position is as in order to encode Vs- The last position 
of a block is set to the synchronizing symbol Let Cij be the i'th choice string in Sc', then, all 
remaining positions in the block are set to Cj^'s identification symbol (pi'. Thus, the block is given 
by 

{hlock{i,j,{vr,Vs))) := ((/Jj/)*"^ cr^ (c/^i')^"*"^ <Ts ((/?»')''"•' #. 
Values for L and d. We set L := k + 1 and d := k — 2. 

Example 1. Let G = {V,E) be an undirected graph with V = {vi,V2,V3,V4^} and E = {(fi,f3), 
(f 1, ^4), {v2,V3), (^3, V4)} (as shown in Fig. [l|(a)) and let k = 3. Using G, we exhibit the above con- 
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struction of (2) = 3 choice strings ci, C2, and C3 (as shown in Fig. |l|(b)). Note that, in the described 
construction, the strings were called ci^2, ci^a, and 02,3, but, here, for the ease of presentation, we 
call them ci, C2, and C3. We claim that (which will be proven in the following subsection) there 
exists a clique of size A; in G iff there is a string s of length L := (2) +1 = 4 such that, for i = 1,2, 3, 
each Ci contains a length 4 substring Sj with duici, Si) < d := k — 2 = 1. 

The choice strings are over an alphabet consisting of {<7i, o"2, (T3, CJ4} (the encoding symbols, i.e., 
one symbol for every node of G), {(pi,ip2,(p3} (the string identification symbols), and {#} (the 
synchronizing symbol). Every string q, i = 1,2,3 consists of four blocks, each of which encodes an 
edge of the graph. Every block is of length (2) + 1 = 4 and has # at its last position. The blocks 
are separated by barriers consisting of {(fi)^ = (fi)^- 

In string ci, positions 1 and 2 within a block are active and encode the corresponding edge (in C2 
positions 1 and 3, and, in C3 positions 2 and 3 within a block are active). All of the first k positions 
of a block in string Cj, i = 1,2,3 which are not active, contain the ipi symbol. Thus, e.g., the block 
in ci encoding the edge {vi,vs) is given by aia^cpi^. Further details can be found in Fig. |l]. 

The closest substring that corresponds to the /c-clique in G consisting of vertices vi, V3, and V4 
is o"i(T3(T4#. The corresponding matches are (7iasipi# in ci (encoding the edge (^1,^3)), crnp2cr4# 
in C2 (encoding the edge {vi,V4,)), and (fsc^sa^^ in C3 (encoding the edge (^3,^4)). 



3.2 Correctness of the Reduction 



To prove the correctness of the proposed reduction, we have to show an equivalence, consisting of 
two directions. The easier one is to see that a /c-clique implies a closest substring fulfilling the given 
requirements. 

Proposition 1. For a graph with a k-clique, the construction in Subsection |3.l| produces an in- 
stance 0/ Closest Substring which has a solution, i.e., there is a string s of length L such that 
every Cij G Sc has a substring Sij with dH{s,Sij) < d. 



Proof. Let the input graph have a clique of size k. Let /ii, /12, . . . , /ifc denote the indices of the 
clique's vertices, 1 < hi < h2 < . . . < h^ < n. Then, we claim that a solution for the produced 
Closest Substring instance is 

Consider choice string Cjj, 1 < i < j < k. As the vertices Vh^,Vh^, . . . ,Vh^, form a clique, we have 
an edge connecting Vh^ and Vhy Choice string contains a block Sjj := (block(i, j, (u/^^, w/^^. ))) 
encoding this edge: 

where i' is the number (according to the given order) of the choice string in Sc- We have dnis, Sjj) = 
k — 2, and we can find such a block for every Cjj, 1 < i < j < k. □ 



For the reverse direction, we show in Proposition ^ that a solution in the produced Closest 
Substring instance implies a /c-clique in the input graph. For this, we need the following two 
lemmas, which show that a solution to the instance constructed in Subsection 3T has encoding 
symbols at its first k positions and the synchronizing symbol # at its last position. 
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Lemma 1. A closest substring s contains at least two encoding symbols and at least one synchro- 
nization symbol. 



Proof. Let s be a solution of the Closest Substring instance produced by the construction in 
Subsection 3.1, Let A^{s) be the set of string identification symbols from { | 1 < ? < (2) } that 



occur in s. Let S^{s) C Sc be the subset of choice strings that do not contain a symbol from A^{s). 

Since s is of length k+1, we have < k+1. Therefore, for /c > 4, there are at least (2) — (A;+l) 

choice strings in ^^(s). We show that with less than two encoding symbols and no synchronizing 
symbol, we cannot find matches for s (with maximally allowed Hamming distance d = k — 2) in 
the choice strings of Observe that, in every choice string, because of the barriers, every 

length k + 1 substring contains at most two encoding symbols and at most one symbol Observe 
further that, taken a choice string from Sip{s), positions with symbols from { v^i | 1 < i < (2) } 
cannot coincide with the corresponding positions in s. Therefore, s has a match in such a string only 
if s has two encoding symbols and one symbol ^ that all coincide with the corresponding positions 
in the selected substring. This proves the claim for /c > 4. Regarding A; = 3, if |yl^(s)| < 3, then 
the above argument applies here, too. If, however, |^(^(s)| = 3, a length 4 substring in every choice 
string has at least two positions that do not coincide with the corresponding positions in s. □ 



Based on Lemma |l], we can now exactly specify the numbers and positions of the encoding and 
synchronizing symbols in the closest substring. 

Lemma 2. A closest substring s contains encoding symbols at its first k positions and a symbol ^ 
at its last position. 



Proof. Let n^{s) denote the number of symbols ^ in s, let n<^(s) denote the number of string 
identification symbols in s, and let no-(s) denote the number of encoding symbols in s. Let Sip{s) C 
Sc be the subset of choice strings whose string identification symbol does not occur in s. In the 
following, we establish a lower bound on the number of strings in S^{s) and an upper bound on 
the number of strings from ^^(s) in which we can find a match for s. Comparing these bounds, 
we will show that, if n^{s) > 1, then there are choice strings in S^{s) in which we cannot find a 
match; we will conclude that n^{s) = 1. Then, we will show that, if na{s) < k, then again there 
are strings in S^{s) without a match; we will conclude that n„{s) = k. 

Regarding the size of S^{s), a lower bound on its size is |5'(^(s)| > (2) — n^{s). To explain the 
upper bound on the number of strings from S^{s) in which we can find a match for s, we recall 
that such matches must contain two encoding symbols and one symbol # that all coincide with the 
corresponding positions in s. On the one hand, the synchronizing symbol of a block must coincide 
with a symbol # in s. On the other hand, in all blocks of a choice string, its encoding symbols 
are in fixed positions relative to the block's synchronizing symbol, e.g., in choice string 01^2, the 
encoding symbols are located only at the first and second position and # at the last position of 
a block in ci.2. For these two reasons, one symbol # in s can provide matches in at most ("'^2'^^) 
choice strings from 5(^(s). Consequently, n#(s) many symbols # in s can provide matches in at 
most nj^{s) ■ (""2'^^) choice strings from Sip{s). 

Summarizing, we have at least (2) — nip{s) choice strings in S^p{s) and we can find matches in at 
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most n^{s) ■ ("'°2*'') many of them. Thus, we find matches for s in all choice strings only if 

n#(.).("f))>Q-n,(.). (1) 

In order to show that s contains exactly one synchronizing symbol, we assume that n^{s) > 1 (we 
know that n^{s) > 1 by Lemma |^ while k > 2, and show that inequality |^ is violated. 

We know that A; + 1 = no-(s) + n(^(s) + n^{s) and, by Lemma ||, that no-(s) > 2. Using these, we 
conclude, on the one hand, that n#(s) • ("°2''^) < ■ (^''^^~^*^^^'^ and, since n^{s) > 2, that 

n#(s)-(^+^7#(')) <2-(^'-^). On the other hand, we have that (^)-n^(s) > {^) - {k - 1 - n#{s)) 
and, since 'njj,{s) > 2, (2) — (A; — 1 — n#{s)) > (2) — {k — 3). For A; > 3, however we have 
(^) - (A; -3) > 2 - ('^2^). Thus, 

n^is) . (-f < n^is) . + ^ - ^*(^)) < (2) - - 1 - < (2) - Ms), 

i.e., there are choice strings in S^{s) which contain no match for s, a contradiction. Since (Lemma |l|) 
n^{s) > 1, we conclude that n^{s) = 1. 

In order to show that s contains exactly k encoding symbols, we assume that no-(s) < k while k > 2 
and n^{s) = 1, and show that inequality |l| is violated. Since A; + 1 = no-(s) + n^(s) + n^{s) = 
no-(s) + n^{s) + 1, we have (2) — n^{s) = (2) — (A; — na{s)) and, thus, 



"f')<(')-(*--(»))^(: 



i.e., again, some strings in S'^(s) have no match for s, a contradiction. Thus, on the one hand, we 
have no-(s) > k, and, on the other hand, we have n^{s) = 1 and, therefore, no-(s) < k. 

Note that, if an encoding symbol is located after the synchronizing symbol in s, then, due to the 
barriers, it is not possible that both ^ and this encoding symbol coincide with the respective 
positions in a choice string from Sip{s). Therefore, symbol # is located at the last position of s. □ 



Proposition 2. The first k characters of a closest substring correspond to k vertices of a clique in 
the input graph. 



Proof. By Lemma |2|, a closest substring s has encoding symbols at its first k positions and a 
synchronizing symbol at its last position. Consequently, the blocks are the only possible matches 
of s in the choice string. Now, assume that s = ahiCrh2 ■ ■ ■ crh^i^ for /ii, /12, • • • , % ^ {^1 ■ ■ ■ 
Consider any two hi,hj, 1 < i < j < k, and choice string Cij. Recall that in this choice string, 
the blocks encode edges at their ith and j'th position, they have # at their last position, and all 
their other positions are set to a string identification symbol unique for this choice string. Thus, 
we can only find a block that is a match if there is a block with cj/i- at its ith position and at its 
jth position. We have such a block only if there is an edge connecting u/j . and Vh . Summarizing, 
the closest substring s implies that there is an edge between every pair of {vhj^,Vh2, • • • , Vhf.}] these 
vertices form a A:-clique in the input graph. □ 
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Propositions]^ and |2| establish the following hardness result. Note that hardness for the combination 
of all three parameters also implies hardness for each subset of the three. 

Theorem 1. Closest Substring with unbounded alphabet is W[l]-hard for every combination 
of the parameters L, d, and k. 



4 Closest Substring: Binary Alphabet 



We modify the reduction from Section |3| to achieve a Closest Substring instance with binary 
alphabet proving a W[l]-hardness result also in this case. In contrast to the previous construction, 
we cannot encode every vertex with its own symbol and we cannot use a unique symbol for every 
produced string. Also, we have to find new ways to "synchronize" the matches of our solution, a 
task previously done by the synchronizing symbol To overcome these problems, we construct an 
additional "complement string" for the input instance and we lengthen the blocks in the produced 
choice strings considerably. 



4.1 Reduction of Clique to Closest Substring 



Number strings. To encode integers between 1 and n, we introduce number strings (number(pos)), 
which have length n and which have symbol "1" at position pos and symbol "0" elsewhere: 
Qpos-i -^Qu-pos ^ contrast to the reduction from Section |3|, now we use these number strings 
to encode the vertices of a graph. 

Choice strings. As in Section]^, we generate a set of (2) choice strings Sc = {ci^2,ci,3 ■ ■ ■ ,Ck-i^k}- 
Again, every choice string will consist of m blocks, one block for every edge of the graph. The 
choice string Cij is given by 

:= (block(i, j, ei))(block(i,j, 62)) . . . (block(i, j. Cm)), 

where ei, 62, . . . , Cm are the edges of the input graph and (block()) is defined below. The length of 
a closest substring will be exactly the length of one block. 

Block in a choice string. Every block consists of a front tag, an encoding part, and a back 
tag. A block in choice string Cij encodes an edge e; let e be an edge connecting vertices Vr and Vs, 
1 < r < s < n, and let Cij be the (according to the given order) i'th string in Sc- Then, the 
corresponding block is given by 

(block(i, j, {vr,Vs))) := (front_tag) (encode (u^, Ws)))(back_tag(«')). 



Front tags. We want to enforce that a closest substring can only match substrings at certain 
positions in the produced choice strings, using front tags: 

(front.tag) := (l3"fco)"^ 

i.e., a front tag has length {3nk + 1) • nk. By this arrangement, the closest substring s and every 



match of s start (as will be shown in Subsection 4.2) with the front tag. 
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Encoding part. The encoding part consists of k sections, each of length n. The encoding part 
corresponds to the blocks used in Section |^. As a consequence, in (block(i, j, e)) the ith and jth 
section are called active and encode edge e = {vr,Vs), 1 < r < s < n; section i encodes Vr 
by (number(r)) and section j encodes Vs by (number(s)). The other sections except for i and j are 
called inactive and are given by (inactive) := 0". Thus, 

(encode(i,j, {vr,Vs))) '■= ( (inactive) )*^^ (number(r)) ((inactive))-'"*"^ (number(s)) ( (inactive ))^~-'. 



Back tag. The back tag of a block is intended to balance the Hamming distance of the closest 
substring to a block, as will be explained later. The back tag consists of (2) sections, each section 
has length nk — 2k + 2. The i'th section consists of symbols "1," all other sections consist of 
symbols "0": 

(back_tag(i')) := o^''-'^^^'-^'+^h''''-^''+^0^(2)-i')(-^-^'^+^) 

Template string. The set of choice strings is complemented by one template string. It consists, 
in analogy to the blocks in the choice strings, of three parts: A front tag of length {3nk + 1) • 
nk, followed by a length nk string of symbols "1," followed by a length {^^^{nk — 2k + 2) string 
of symbols "0." Thus, the template string has the same length as a block in a choice string, 
i.e., {3nk + 1) ■ nk + nk + (2) (rife - 2k + 2). 

Values for d and L. We set L := {3nk + I) ■ nk + nk + (2) {nk — 2k + 2) and d := nk — k. As we 



will show in Subsection 4.2, the possible matches for a string of this length are the blocks in the 



choice strings, and, concerning the template string, the template string itself. 

Notation. For a closest substring s, we denote its first (3n/c + 1) ■ nk symbols (the front tag) 
by s', the following nk symbols (its encoding part) by s", and the last {^{nk — 2k + 2) symbols 
(its back tag), by s'" . Analogously, the three parts of the template string t are denoted t\t" , and 
t'" . A particular block of a choice string Qj, is referred to by Sjj-; its three parts are called s^j, s'lj, 
and s'l'y 

Example 2. Let G = iV^E) be the graph from Example |l], with V = {t'l, i'2) 't'S; ^^4} and E = 
{{vi,V2,)^ {vi,Vi), {v2,v^), (u3,V4)} (as shown in Fig. |l|(a)) and let k = 3. In the following, we 
outline the above construction of (2) = 3 choice strings ci, C2, and C3 and one template string t 
over alphabet S = {0, 1} as displayed in Fig. |2|. 

Every string ci, C2, and C3 consists of four blocks corresponding to the four edges of G. Fig. ||(a) 
displays the first block of ci corresponding to edge {vi,v^). It consists of a front tag, an encoding 
part, and a back tag. The front tag (not displayed in detail in the figure) is given by (front_tag) := 
^j^SnfcQ-jnfc _ ^-|^36q^12. ^ front tags for all blocks in all constructed strings are the same. The back 
tag of the first block consists of (2) sections; since the back tag is in the first string, the first section 
is filled with "l"s and the remaining sections are filled with "0"s. Thus, the back tag is given by 
ink-2k+2Q{{^)-i){nk-2k+2) ^ ]^8q16^ ^jg^^j^ ^.g^gg £qj. ^jocks in the first string are given like this. 

The encoding part consists of A; = 3 sections, each section of length n = 4. In the blocks of string ci, 
the first and the second section are active; in the first block they encode edge {vi,v^). Therefore, 
the first section is given by (number(l)) and the second one by (number(3)), the remaining inactive 
section is filled with "0"s. 
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Figure 2: Example for the reduction from the Clique instance G (shown in Fig. ||(a)) to a 
Closest Substring instance with binary alphabet as explained in Example ^. When display- 
ing the strings, we omit the details of the front tag parts and only indicate them shortened in 
their proportion to the other parts of the strings; all front tag parts in all strings are equal. In the 
encoding parts and the back tag parts, we indicate the symbols "1" of the construction by dark 
boxes, the symbols "0" by white boxes. In (a), we outline the first block of ci. In its encoding part, 
sections 1 and 2 (sections are indicated by bold separating lines) are active (indicated by dashed 
boxes) and encode the first edge {vi,v^) of graph G; the remaining third section is inactive. In 
its back tag part, the first section is filled with symbols "1." In (b), we give an overview on all 
constructed strings, the choice strings ci, C2, and C3, and the template string t. We also display the 
closest substring s that is found, since G has a clique of size k = 2>; its matches in ci, C2, C3, and t 
are indicated by dashed boxes. In (c), we focus on these matches and the solution string s, and 
state, separately for the front tag, the encoding, and the back tag part, the Hamming distances of 
s to a match Sj, i = 1, 2, 3 (the distances are equal for si, S2, and S3) and to the template string t. 
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Fig. §(b) displays an overview on all constructed strings ci, C2, C3, and t. In all strings, block i 
encodes the ith edge, 1 < i < 4. However, the active sections of the encoding part and the back 
tags differ for different strings. The template string t consists only of one block, which has a front 
tag, a part corresponding to the encoding part, filled with "l"s, and a part corresponding to the 
back tag, filled with "0"s. 

Since G has a A;-clique for = 3, consisting of vertices wi, v^, and V4, we find a solution s for 
the constructed Closest Substring instance. This s has a front tag, and its back tag part 
is filled with "0" symbols. The encoding part encodes the vertices of the clique, it is given by 
(number(l)) (number(3)) (number(4)). 

Fig. |2|(c) gives a focus on the matches that are found in ci, C2, C3, and t, which are, for the choice 
strings, referred to by si, S2, and S3, respectively. The front tag part s' has distance to the front 
tags s[, S2, S3, and t'. The encoding part s" contains k = 3 many "l"s; s'/, S2, S3 have two "l"s 
each and, in each case, these "l"s coincide with "l"s in s". Therefore, dH{s",Si) = k — 2 = 1, 
1 < i < 3. The encoding part of the template string, t", only consists of "l"s and, therefore, 
dH{s",t") = nk — k. The back tag s'" only consists of "0"s; each back tag s'/', s'2 and S3' 
contains nk — 2k + 2 = S many "l"s; therefore dnis'" , s"') = 8, 1 < i < 3. The back tag of the 
template string, t'", contains only "0"s and, hence, dnis'" -.t'") = 0. Altogether, this shows that, 
for 1 < i < 3, (i_H'(s, Sj) = d// (s, t) = n/c — A; = 9 as required. 

4.2 Correctness of the Reduction 

To prove the correctness of the reduction, again the easier direction is to show that a A;-clique 
implies a closest substring fulfilling the given requirements. 

Proposition 3. For a graph with a k-clique, the construction in Subsection [4.1| produces an in- 
stance 0/ Closest Substring that has a solution, i.e., there is a string s of length L such that 
every Cij G Sc has a length L substring Sij with dH{s,Sij) < d and dnis^t) < d. 

Proof. Let the graph have a clique of size k. Let /ii, /12, . . . , /ifc denote the indices of the clique's 
vertices, 1 < /ii < /12 < . . . < /i^ < n. Then, we can find a closest substring s, consisting of 
three parts s', s" , and s'" , as follows: its front tag s' is given by (front_tag); its encoding part s" 
is given by (number(/ii))(number(/i2)) . . . (number(/ifc)); its back tag s'" is o(2)("''^~2'^+2). It follows 
from the construction that the choice strings have substrings that are matches for this s: For every 
1 < i < j < k, we produced choice string Cij with a block Sij encoding the edge between vertices u/j. 
and Vhj ■ For these blocks as well as for the template string, the following table reports the distance 
they have to the solution string, separately for each of their three parts and in total: 





s' 


s" 


s'" 


s 


match Sij in choice string Cij 





k-2 


nk-2k + 2 


nk — k 


template string t 





nk — k 





nk — k 



As is obvious from these distance values, the indicated substrings in the choice strings all have 
Hamming distance d = nk — k to the solution string and, therefore, are matches for s. □ 
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For the reverse direction, we assume that the Closest Substring instance has a solution. We 
need the following statements: 

Lemma 3. A solution s and all its matches in the input instance start with the front tag. 

Proof. Since s is of length L = {3nk + 1) ■nk + nk+ (2) {nk — 2k + 2), the only possible match in the 
template string is the template string itself. Therefore, s' can differ from t' in at most d = nk — k 
symbols. We can show that the only substrings in a choice string Cjj that are possible matches 
for s with Hamming distance at most d start with the front tag, as we argue in the following. 

Since s is a solution, there is a match in and we denote it by Sjj. Denote the the first {3nk+\)-nk 
symbols of Sjj by s[y Since dH{s',s'-j) < nk — k and duis' ,t') < nk — k, we necessarily (triangle 
inequality for Hamming metric) have dnis'ijjt') < 2{nk — k). We show that this is only possible 
when s'j^ j coincides with a front tag of a block of Cjj. Assuming that it does not, we will show that 
dnis'i j-, t') > 2{nk — k), a contradiction. 

Firstly, assume that the starting position of s'^ j and the starting position of a front tag in Cij differ 
by p positions, 1 < p < 3nk. Then, at least nk — 1 symbols "0" of t' are aligned with symbols "1" 
of the front tag in s'^j and at least nk — 1 symbols "1" of t' are aligned with symbols "0" of s'-j. 
This implies duis'^pt') > 2nk — 2. Secondly, assume that the starting position of s^^- and the 
starting position of its closest front tag in Cjj differ hy p > 3nk positions. Then, a block of 3nk 
symbols "1" falls onto the encoding and/or the back tag part of s^j. Since the encoding part and 
back tag contain together only 2 + (nk — 2k + 2) < nk (under the assumption that k > 2) many 
symbols "1", we have more than 2nk mismatching symbols and (i//(s^ t') > 2{nk — k). 

Summarizing, we conclude that s[j coincides with a front tag in choice string c'^j, i.e., s'^j = t' = 
s' = (front_tag). □ 

Lemma 4. The encoding part of s contains exactly k symbols "1 ". 

Proof. Assume that s has less than k symbols "1" in its encoding part, i.e., s" contains less than k 
symbols "1". Then, because t" = 1"''', duis" ,t") > nk - k + I, implying dH{s,t) > nk - k + I, a 
contradiction. 

Assume that s has more than k "1" symbols in its encoding part s". Then, dnis" , s'- j) > k — 2 
for the encoding part s" ^ of a match in every choice string Cij. Now consider the solution's back 
tag s'" . To achieve dnis, Sij) < nk — k, we need dnis'" , s'"j) < nk — 2k + 2 and s'" must contain 
one or more symbols "1". Every "1" symbol in s'" will decrease the value dnis, Sij) for a block Sij 
of one choice string Cij by one, but will increase the solution's Hamming distance to the selected 
blocks of all other choice strings. No matter how many "1" symbols we have in the back tag of s, 
there will always be a choice string Cij with dnis'", s"'j) > nk — 2k + 2. In summary, we will always 
have a choice string Cij with ^^(s, Sij) = dn^s" , s'-j) + dnis'" , s'-j) > nk — k, a contradiction. □ 

Lemma 5. Every section of the encoding part of s contains exactly one symbol "1". 

Proof. Assume that not every section in the encoding part of s contains exactly one "1" symbol. 
Then, there must be a section containing no symbol "1", since, by Lemma the number of 
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symbols "1" in the encoding part of s adds up to k. Let i' , 1 < i' < k, he the section containing 
no symbol "1". W.l.o.g., consider a choice string Cj/j, i' < j < k or, if i' = k, a choice string Ci'j, 
1 ^ j < i'- In ever?/ block Si'j of Cj/j, sections i' and j of the encoding part are active and, therefore, 
contain exactly one symbol "1" each; these are the only symbols "1" in s" j. Now consider the k 
symbols "1" in the encoding part of s: The "l"s in all sections of s" except for section j are all 
aligned with "0"s in s'-, within section j, only a single "1" of s" can be matched to a "1" of s'^, ■. 
Therefore, dnis"-, s'-, ■) > k — 2. As in the proof of Lemma ^, we conclude that s is no solution. □ 

Proposition 4. The k symbols "1" in the solution string's encoding part correspond to a k-clique 
in the graph. 

Proof. Let s be a solution for the Closest Substring instance. Summarizing, we know by 
Lemma |^ that s can have as a match only one of the choice string's blocks. By Lemma ^ every 
section of the encoding part s" contains exactly one "1" symbol; therefore, we can read this as an 
encoding of k vertices of the graph. Let Vf^ , , • • • , f ftj. be these vertices. Further, we know that the 
back tag s'" consists only of "0" symbols: By Lemma ^, the encoding part s' has only k "l"s; would 
s'" contain a "1" , then we would have d//(s, t) > nk — k. We have dnis'", s'-'j) = nk — 2k+2 for every 
choice string match Sij and, since every s"j contains only two "1" symbols, dnis" , s" j) > k — 2. 
Now consider some 1 < i < j < k and the corresponding choice string Cjj. Since s is a solution, 
we know that there is a block Sij with dn^s" , s'- j) = k — 2. That means that the two "1" symbols 
in s'l J have to match two "1" symbols in s"; this implies that the two vertices v^^ and v^^ are 
connected by an edge in the graph. Since this is true for all 1 < i < j < k, vertices f/^^, . . . , Vh^ are 
pairwisely interconnected by edges and form a /c-clique. □ 

Propositions ^ and ^ yield the following main theorem: 

Theorem 2. Closest Substring is W[l]-hard for parameter k in the case of a binary alphabet. 



5 Consensus Patterns 



Our techniques for showing hardness of CLOSEST Substring, parameterized by the number k 
of input strings, also apply to Consensus Patterns. Because of the similarity to Closest 
Substring, we restrict ourselves to explaining the problem and pointing out new features in the 
hardness proof. 

Given strings si,S2,...,Sk over alphabet S and integers d and L, the Consensus Patterns 
problem asks whether there is a string s of length L such that Y2i=i^H{s, s'j) < d where s- is a 
length L substring of Sj. Thus, Consensus Patterns aims for minimizing the sum of errors. Since 
errors are summed up over all strings, the value of d will, usually, not be a small and, therefore, 
the most significant parameterization for this problem seems to be the one by k. The problem is 



NP-complete and has a PTAS [19|. By reduction from Clique, we can show W[l]-hardness results 
as for Closest Substring given unbounded alphabet size. We omit the details here and focus on 
the case of binary input alphabet. We can apply basically the same ideas as were used in Section^; 
however, some modifications are necessary. 
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5.1 Reduction of Clique to Consensus Patterns 



Choice strings. As in Subsection 41, we generate a set of (2) choice strings Sc = {ci,2j 
ci^2 • • • 1 Ck~i,k} with Cjj := (block(z, j, ei))(block(i, j, 62)) . . . (block(i, j, em)), encoding the m edges 
of the input graph. This time, however, every block consists only of a front tag and an encoding part. 
No back tag is necessary. Therefore, we use (block(i, j, {vr-,Vs))) ■= (front_tag)(encode(i, j, {vr,Vs))), 
in which the encoding part (encode (i, j, (1;^, fs))) is constructed as in Subsection |4.1| . Before we 
explain the front tags, we already fix the distance value d. 

Distance Value. We set the distance value d := {(^) ~ ~ ^))nk. 

Front tags. The front tag is now given by (i"fc^o)"'^^0"^^. Thus, the front tag has length n'^k^ + 
2nk^. The front tag here is more complex than the one used in Subsection 4.1. The reason is as 



follows. Its purpose is to make sure that a substring which is not a block cannot be a match. To 
achieve this, the front tag lets such an unwanted substring necessarily have a distance value larger 
than d to a possible solution (as explained in the proof of Lemma ^). Since d has a higher value 
here compared to Section we need the more complex front tag. 

Solution length. We set the substring length to the length of one block, i.e., the sum oin^k^+2nk^ 
(the length of the front tag) and nk (the length of the encoding part). Therefore, L := n^k^ + 
2nk^ + nk. 



Template strings. In contrast to Subsection 4.1, we produce not only one but (2) — {k — l) many 



template strings. All template strings have length L, i.e., the length of one block. The template 
strings are a concatenation of the front tag part (as given above) and an encoding part consisting 
of nk many symbols "1" . 



In summary, the front tag ensures that only the block of a choice string can be selected as a substring 
matching a solution. Regarding the distribution of mismatches, we note that a closest substring's 
front tag part will not cause any mismatches. In its encoding part, every of its nk positions causes 
at least (2) — (/c — 1) mismatches. It causes exactly (2) — (fc — 1) mismatches for every position iff 
the input graph contains a A;-clique. 



5.2 Correctness of the Reduction 

Proposition 5. For a graph with a k-clique, the construction in Subsection |5.1| produces an in- 
stance 0/ Consensus Patterns which has a solution, i.e., there is a string s of length L such that 
every Cij, I < i < j < k, has a substring Sij with J2i=i X]j=i+i dnis, Sij) < d. 

Proof. Given an undirected graph G with n vertices and m edges, let 1 < /ii < /i2 < • • • < /ifc < ^^ be 
the indices of A;-clique's vertices. Then, let string s consist of the front tag described in the above 
construction, concatenated with the encoding part (number(/ii))(number(/i2)) • • • (iiumber(/ifc)), 
which encodes all clique vertices. For every 1 < i < j < k, we choose in choice string Cij the 
block Sij encoding the edge connecting vertices u/j. and ■ We will show that these blocks have 
exactly total Hamming distance ((2) — [k — l))nk to s. 



17 



The front tags of s and of each Sj j comcide, then' Hamming distance is 0. Recall from Subsection 4.1 



that the encoding parts consist of k sections, each section of length n. We consider the encoding 
parts section by section and, within a section, columnwise. Given a section i', 1 < i' < k, there 
are k — 1 choice strings in which this section is active, and this section in these blocks encodes 
vertex Vh., ■ Consider the column at position hi' in this section, over all selected substrings and all 
template strings. We have (2) — {k — 1) "0" symbols from the choice strings in which this section 
is inactive; in all other strings, there is a "1" at this position. In s, this position is "1," causing 
(2) ~ ~ ^) mismatches. Now consider the remaining columns of section i'. In each of them, we 
have (2) — {k — 1) "1" symbols from the template strings; all (2) choice strings have "0" at the 
corresponding position. In s, this position is "0," causing (2) — {k — 1) mismatches. Thus, we have 
(2) ~ ~ ^) mismatches at every of the n positions within a section, and this is true for all k 
sections of the encoding part. The sum of distances from s to the matches in choice strings and 
the template strings is ((2) — {k — l))kn; s is a solution. □ 



For the reverse direction, we use two lemmas to show important properties that a solution of the 
constructed instance has. The first lemma is proved in analogy to Lemma ^. 

Lemma 6. A solution s and all its matches in the input instance start with the front tag. □ 



The second property of a solution, although also valid for the solutions in Subsection is estab- 
lished in a different way here. It relies on the additional template strings that have been introduced 
in the construction of the CONSENSUS Patterns instance. 

Lemma 7. A solution s contains exactly one symbol "1 " in every section of its encoding part. 



Proof. Let s be a solution for the constructed CONSENSUS Patterns instance. By Lemma ^ we 
know that s and all its matches in the choice strings start with the front tag. Consequently, the 
matches in the choice strings must be blocks. 

Consider the encoding part of a solution s together with the encoding parts of its matches in the 
input strings. We note that we have at least (2) ~ 1) mismatches for every column at positions p, 
1 < p < nk: On the one hand, all (2) — {k — 1) template strings have "1" symbols at position p. 
On the other hand, all (2) ~ {k — 1) choice strings in which position p's section is inactive have "0" 
at this position, no matter which blocks we chose in these choice strings. Since s is a solution and 
only a total of ((2) — {k — l))nk mismatches are allowed, we have exactly (2) — {k — 1) mismatches 
for every position of the encoding part of s with the corresponding positions in the matches of s. 

Now, consider an arbitrary section z', 1 < i' < A:, and consider all /c — 1 choice strings in which 
section i' is active. In these choice strings, section i' contains exactly one "1" symbol. We will 
show that in these choice strings' blocks that form the matches for s, the "1" in section i' must 
be at the same position in all matches, because, otherwise, s is no solution. Assume that we chose 
blocks in which the "1" symbols of section i' are at different positions. We can easily check that 
this would cause more than (2) ~ ~ 1) mismatches for the columns corresponding to the positions 
of the "1" symbols; this contradicts the assumption that s is a solution. We conclude that, for all 
matches in choice strings, the "1" symbols of section i' must be at the same position. For columns 
in which we have "1" symbols in choice strings, there is a majority of "1" symbols, namely those in 
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the {k — 1) choice strmgs in which section i' is active and those in the (2) ~ 1) template strings. 
Therefore, the respective position in s must be "1." For all other columns, there is a majority of 
"0" symbols, namely those in all (2) choice strings. Therefore, the respective position in s must 
be "0." □ 



These two lemmas allow us to show that also the reverse direction of the reduction is correct. 

Proposition 6. The k symbols "1" in the solution string's encoding part correspond to a k-clique 
in the graph. 

Proof. Let s be a solution for the constructed Consensus Patterns instance. By Lemma |^ every 
section in the encoding part of s encodes a vertex of the input graph. In the following, we show 
that all encoded vertices are interconnected by edges. 

Let Vc = {vhi,Vh2, ■ ■ ■ iVhf,} be the vertices encoded in the solution's encoding part. For every 
two sections 1 < i < j < k, we select in choice string Cij a substring in which the "1" symbols of 
sections i and j are at the same positions as the "1" symbols of sections i and j in the solution: 
Selecting another substring would result in a Hamming distance greater than (2) — (A; — 1) in the 
kith and hjth. column and s could not be a solution. Hence, the selected block encodes the edge 
connecting Vh^ and Vh . Since we find such a substring for every 1 < i < j < k, every pair of 
vertices in Vc is connected by an edge, Vc is a fe-clique. □ 

Propositions ^ and |^ yield the following main result. 

Theorem 3. Consensus Patterns is W[l]-hard for parameter k in case of a binary alphabet. 



6 Conclusion 



We have proven that Closest Substring and Consensus Patterns, parameterized by the 
number k of input strings and with alphabet size two, are W[l]-hard. This contrasts with related 
sequence analysis problems, such as Longest Common Subsequenge M, 0] and Shortest Com- 



mon SUPERSEQUENCE |17], where, until now, parameterized hardness has only been established 
in the case of unbounded alphabet size. Now, it is also known that these problems, parameter- 
ized by the number of input strings, are W[l]-hard in case of bounded alphabet size [^^. In our 
opinion, however, intuitively speaking, our W[l]-hardness result for Consensus Patterns is the 
most surprising one in this context, because Consensus Patterns seems to carry significantly 
less combinatorial structure than the other problems. 

The parameterized complexity of Closest Substring and Consensus Patterns, parameterized 
by "distance parameter" d, remains open for alphabets of constant size. If these problems are also 
W[l]-hard, then an efficient and practically useful PTAS would appear to be impossible |^, p^ , 
unless further structure of natural input distributions is taken into account in a more complex 
aggregate parameterization of these basic computational problems of bioinformatics. 
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